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Comparing the stability of IRT-based and non IRT-based DIF methods in different 

cultural contexts using TIMSS data 

Richard Bertrand, Universite Laval 
Nancy Boiteau, Universite Laval 



Summary 

The undertaking of a DIF operation can be costly and time consuming, especially if we were to use two or more 
DIF detection methods just to be sure we identified the «right» DIF items. This paper aims at finding criteria like 
within-method stability rates or between-method agreement rates that could help to choose a powerful and low 
cost DIF detection method. 

Boiteau, Bertrand, Frenette & Saint-Onge (2002) showed that in similar cultural contexts (French Canadians, 
English Canadians), IRT-based DIF procedures were somewhat more stable than non-IRT based procedures 
from one linguistic group (English) to another (French). In the present study we tried to verify this within- 
method stability of IRT-based over non IRT-based procedures in two different cultural contexts. TIMSS95 and 
TIMSS99 data were used to see if the items identified as having translation DIF in 1995 between USA (reference 
group) and Japan (focal group) were the same in 1999. Four procedures were used for that purpose: two IRT- 
based procedures, the UPD index (Shepard, Camilli & Williams 1984; Camilli & Shepard, 1994) and the NCDIF 
index proposed by Raju, van der Linden & Fleer (1995); and two non-IRT based procedures, the Mantel- 
Haenszel (MH) approach (Holland & Thayer, 1988) and logistic regression (LR) method (Clauser & Mazor, 
1998). In each case, absolute and relative criteria were used to classify the strength of DIF. The absolute criteria 
are those proposed by Ziecky (1993) for MH, by Gierl, Rogers and Klinger (1999) for LR, by Boiteau, Bertrand, 
Frenette & Saint-Onge (2002) for the UPD index and by Raju, van der Linden & Fleer (1995) for the NCDIF 
index. The relative criteria are based on outliers detection used in box-and- whiskers diagram (Tukey, 1977). This 
paper also investigated between-method agreement rates. Results show that non IRT-based methods and 
especially Mantel-Haenszel (a low cost method) possessed between-method agreement rates as high as those 
obtained by IRT-based methods. Also, the stability rates of non-IRT based methods have been found to be very 
close to the stability rates of IRT-based methods: this last result challenged the one obtained by Boiteau, 
Bertrand, Frenette & Saint-Onge (2002) and Boiteau & Bertrand (in press) since they found IRT-based 
procedures somewhat more stable. 



Test translation/adaptation issue 



The globalization context which prevails today also hit the assessment arena (O’Leary, 2002). More and more 
tests must now be translated/adapted from a language/culture to another. Many have raised the issue of lack of 
measurement equivalence for tests translated/adapted from a language/culture to another (Allalouf, 2003; 
Hambleton, 1993; Poortinga, 1995; Sired, 1997). The International Test Commission developed guidelines to 
take account of that very important issue (Arnold et Matus, 2000 ; Hambleton, 2001). International large-scale 
assessments like the Third International Mathematics and Science Study (TIMSS) or the Program for 
International Student Assessment (PISA) involving tests translated in many languages must take this problem 
very seriously. 



Problems of different types are associated with this translation/adaptation process in large-scale settings. Among 
these numerous problems is the credibility of comparisons involving countries that differ both linguistically and 
culturally (Hambleton, 1993; O’Leary, 2002 ; Sired, 1996 ; Wainer, 1994). To give these comparisons more 
credibility a very rigourous translation process must be followed. But even in this situation, it would be rash to 
suppose that the source and the target instruments are necessarely equivalent (Erdkan, 1999; Poortinga, 1995; 
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Van de Vijver & Hambleton, 1996). In fact, most of the time this process produce what can be called translation 
bias. As Hambleton (2001) noted, use of judgmental reviews is not enough; empirical studies must be 
undertaken to identify and control for those translation bias. 

In the last decades, quite a few statistical procedures have been proposed to identify translation bias. Van de 
vijver & Leung (1997) developed a three-way classification to identify potential sources of bias: construct bias, 
method bias and item bias. Multiple statistical procedures were proposed to detect construct bias (Van de vijver 
& Leung, 1997; Gierl, 2000; Bertrand et al., 2001) and method bias (Hambleton, 2001; Bertrand et al., 2001) but 
most of the statistical procedures were developed to control item bias, the so-called differential item functioning 
(DIF) procedures. Now these procedures don’t always agree perfectly well: some are more liberal, some are 
more conservative; some are «cheap», some are expensive. How are we going to choose between these 
procedures? The present authors think that between-method agreement rates and within-method stability rates 
should be considered for that purpose. 

The main purpose of this paper is to compare between-method agreement rates and within-method stability rates 
of IRT-based methods and non IRT-based methods to detect DIF items using the 1995 and the 1999 TIMSS 
math assessments. Boiteau, Bertrand, Frenette & Saint-Onge (2002) showed that in similar cultural contexts 
(French Canadians, English Canadians), IRT-based DIF procedures were somewhat more stable than non-IRT 
based procedures. Besides, they found fair between-method agreement rates. But, as noted by Candell & Hulin 
(1986), Hambleton & Kanjee (1995) and Hambleton (2001), the cultural distance must be taken into account in 
DIF studies. Therefore, this paper aims at verifying the between-method agreement rates and the within-method 
stability rates of IRT-based over non IRT-based procedures in two very different cultural contexts (Japan and 
USA) using the 1995 and 1999 TIMSS math data sets. 



Procedures 

Only the 48 TIMSS math items common to the 1995 and the 1999 assessments were used in this study. These 48 
items can be found into one or the other of three of the TIMSS booklets (1, 5 and 7). We ended up, for each 
booklet and each group, with samples of about 1300 students for the 1995 assessment; for the 1999 assessment 
we had, for each booklet, about 1100 students for the reference group and 600 students for the focal group. 

The dimensionality of the scales were tested using full information item factor analysis (Bock, Gibbons & 
Muraki, 1988) as implemented in TESTFACT4 (Bock, Gibbons, Shilling, Muraki, Wilson & Wood 2003). A 
TESTFACT analysis was performed for each the three booklets and each of the linguistic groups. 

Next, we decided not to use a multistage procedure to purify* the internal criterion (ability) partly because this 
procedure was too costly and time consiuning. Besides, while some authors (Navas- Ara & Gomez-Benito, 2002; 
Zenisky, Hambleton & Robin, 2003) would argue in favor of this procedure, some (Gierl, Jodoin & Ackerman, 
2000) would are not convinced. 

Since the statistical tests used in the DIF detection methods are affected by sample size, we decided to use a 
relative criterion, besides the absolute criterion described below, to identify and classify DIF items. To this 
end, the box-and-whiskers plot (see Figure 1) was used to examine the outlier and extreme values of the statistic 

involved (UPD, NCDIF, |A|, x^)- An extreme value on the plot (located at more than 3 times the width of the 
interquartile range from the 3'^'* quartile) indicated a severe DIF (category C). An outlier (not extreme) value 
(located at more than 1.5 times but less than 3 times the width of the interquartile range from the 3^** quartile) 



' Preliminary results using two-stage purifying procedure for non IRT-based methods were very consistent with the results obtained 
with the procedure chosen here (not using purified internal criterion). 
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indicated the presence of a moderate DIF (category B). Otherwise a trivial or negligible DIF was supposed 
(category A). 



Insert Figure 1 about here 



Non based IRT DIF methods 
Mantel-Haenszel 

Holland and Thayer (1986) proposed a statistic, previously discussed by Mantel and Haenszel (1959), to develop 
a method for detecting DIF. Throughout the years, this method became more and more popular (Zwick, 1997). 
The Mantel-Haenszel method compares, for a given item, the probability of obtaining a right answer in the focal 
group to the probability of obtaining a right answer in the reference group for subjects of equal ability. 



There are many ways to determine the presence of DIF using the aMH statistic. The one used in the present paper 
became a favorite to common users of the Mantel-Haenszel method (Roussos et al., 1999). It allows for a more 

complete interpretation of DIF items. First, the value Amh = -2.35 In(OtMH) is obtained. Negative Amh values 
correspond to items favoring the reference group. According to Ziecky (1993) if the absolute value of Amh is 
higher than 1.5 and significantly higher than 1 (at a =.05), the item is classified as category C (severe DIF). If 

the absolute value of Amh is lower than 1 or not significantly higher than 0 (at a =.05), the item is classified as 
category A (trivial DIF). In all other situations, the item is classified as category B (moderate DIF). 

Logistic regression 

Logistic regression (Swaminathan & Rogers, 1990) allowed for the development of a now very popular DIF 
detection method (Clauser & Mazor, 1998). The logistic regression procedure involves two stages. In the first 
stage, total test score is included in the regression equation. In the second stage, two variables related to the 
group and the interaction group* score, are included in the equation. The analysis consists in testing if the 
inclusion of these two variables leads to a statistically significant verdict. If so, it can be said that the item is DIF. 



The absolute criterion used here is the one proposed by Gelin & Zumbo (2003) and Jodoin & Gierl (2001). An 
item would be considered to possess a severe DIF if the chi-square test associated with the second stage is found 
statistically significant and if the R-square difference between the two stages is higher than 0.07. An item would 
be considered to have a moderate DIF if the chi-square test associated with the second stage is found statistically 
significant and if the R-square difference between the two stages is higher than 0.035 but less than 0.07. In all 
other cases, DIF is considered trivial. 



IRT- based DIF methods 

The area method (UPD index) 

The area method (Shepard, Camilli et Williams, 1984; Camilli & Shepard, 1994) focus on a quantity that reflects 
the difference between the reference group and the focal group ICC’s. Two indices were proposed to that end: a 
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signed index (SPD-6^) and an unsigned (UPD-6^) index. If the two ICC’s cross, the difference of probabilities 
involved in the computation of the signed index can cancel out and the value of this index can be low even if a 
large or moderate DIF is manifestly present. Since in this study we want to detect uniform as well as non- 
uniform DIF we will rely on the UPD index which values are always positive. The signed index would be useful 
if we were interested in identifying which group was favored by the item. Notice that the sum of the difference 
of probabilities is taken over the number of subjects in the focal group (tip). 

SPD-e = Ij[ PiR(0j) - PiF(0j)] / np where j = 1,2 np. 

UPD-9 = (Ij[ PiR(0j) - Pip(0j)]' / np ) ’ where j = 1,2 np. 

Since we don’t know any absolute criterion related to the UPD index, we decided to use the value of .10 as a 
threshold: this amounts to consider DIF items those for which the overall difference of probabilities between the 
focal group and the reference group is higher than .10. 



Raju’s NCDIF index 



Raju, van der Linden & Fleer (1995) proposed a very refined fi'amework to look at DIF items. Following these 
authors, two approaches are possible. The first one looks at items that can prevent valid comparisons using total 
score to compare the focal group and the reference group. The second one is interested in identifying DIF items 
that could offense subgroups (Blacks, girls, handicapped, etc.) of a population and that must be changed or else 
completely removed fi'om the item bank. 

The first approach involves a differential test functioning index (DTF) and a compensatory or signed DIF index 
(CDIF) for each item. Since the sum of the values of the CDIF for all items in the test is equal to the DTF value, 
the procedure implies the identification and removal of items (one at a time) with the largest and positive CDIF 
values until the DTF index is no more statistically significant. 

It can be shown that 
DTF=ej{D]) = 

DTF = li CDIFi 



CDIF, = £ = <7 , n + d D. 

V J 

where dy =PiR(0j) - Pip(0j) and Dj = VR(0j) - Vp(0j) for item i and ability level 0j. 



cri +D 



D 



J 



J 



The second approach involves a non-compensatory (unsigned) DIF index (NCDIF) used in the same sense as the 
UPD index described above. 

The NCDIF index is given by the following formula: 



^ Following Camilli & Shepard (1994, p.67) this reads signed probability difference controlling for theta. 
’ TIncioned probability difference controlling for theta. 
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NCDIFi = £ j(dj) = (j3 +d^- where dy =[PiR(0j) - PiF(0j)] and j = 1, 2, np. 

if 

and np refers to the number of subjects in the focal group. 

Chi-square test associated with (np degrees of freedom) this statistic is given by 



NCDIFi 

x"=-^ 



(J 



d. 



V 



The NCDIF index is non compensatory and non signed which means that the values of this index are always 
positive. This index tends to identify items for which the area between the two ICCs is large. In accordance with 
this method (McCarty, Oshima & Raju, 2002; Raju, van der Linden & Fleer, 1995), and since the chi-square 
statistic is influenced by sample size, an item is judged as presenting DIF if the value of NCDIF is higher than 
0.006 and if the chi-square value leads to a statistically significant verdict (at a =.01). 



Results in a Canadian (common culture) context 

Using the data from the Canadian School Achievement Indicators Program (SAIP), Boiteau, Bertrand, Frenette 
& Saint-Onge (2002) showed that in similar cultural contexts (French Canadians, English Canadians), IRT-based 
DIF procedures were somewhat more stable than non-IRT based procedures from one linguistic group (English) 
to another (French). They used samples of 20 000 students in each of the 1996 and 1999 SAIP science 
assessments. 

Table 1 shows that IRT-based methods (NCDIF, UPD) were found more stable than non IRT-based methods 
(MH, LR). Using the 29 items identified as DIF by at least one method, the stability rates of the IRT-methods 
were found higher than 75%, that is more than 75% of the decisions (this item is considered DIF or not!) taken in 
the 1996 SAIP assessment were the same in the 1999 SAIP assessment. 



Insert Table 1 about here 



Results obtained by Boiteau, Bertrand, Frenette & Saint-Onge (2002) have also shown that MH seemed to 
produce lower between-method agreement rates than LR, NCDIF or UPD. 

In another study involving more than 20 000 students in each of the 1997 and the 2001 SAIP math assessments, 
Boiteau & Bertrand (in press) concluded (table 2) that non IRT-based methods were somewhat less stable than 
the IRT-based methods. 



Insert Table 2 about here 
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Results in foreign (different culture') contexts 



Within-method stability rates 



Table 3 summarized the results obtained from the 1995 and the 1999 TIMSS assessments (booklets 1, 5, 7). It 
can be seen that the overall within-method stability rates involving non IRT-based methods are very much the 
same as the rates associated with IRT-based methods. These rates are quite high as they are all close to 80%. 



Insert Table 3 about here 



A look at table 4 (booklet 1), table 5 (booklet 5) and table 6 (booklet 7) shows that large DIF (item 3, item 8, 
item 1 1) are detected by almost all methods in the three booklets. 



Insert Table 4, 5 and 6 about here 



Between-method agreement rates 



As seen in table 7, the two IRT-based methods (NCDIF, UPD) got a very high agreement rate (92%) while the 
two non IRT-based methods (MH, LR) got a quite low agreement rate (67%). Overall, the two methods that had 
the highest agreement rate (95%) are MH and NCDIF. 



Insert Table 7 about here 



Analyzing each booklet separately for each assessment (tables 8 through 13) it can be seen that between-method 
agreement rates were generally high especially for the IRT-based methods. While MH produced agreement rates 
as high as NCDIF and UPD, the between-method agreement rates were not so good for LR, and especially those 
related to booklets 5 and 7. 

Insert Tables 8 to 13 about here 
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Discussion 



The aim of this study was to compare within-method stability rates and between-method agreement rates of four 
DIF detection methods: two IRT-based methods, NCDIF and UPD, and two (low cost) non IRT-based methods, 
MH and LR. Data from USA and Japan samples from the 1995 and 1999 TIMSS math assessments were used 
for that purpose. 

While we found in other studies (Boiteau, Bertrand, Frenette & Saint-Onge, 2002; Boiteau & Bertrand, in press) 
that IRT-based methods were somewhat more stable, this study shows that non IRT-based methods are as stable 
as IRT-based. Many reasons can explain the lack of convergence of these results. First, the former studies were 
undertaken in similar cultural contexts while the present study compares two very different cultures/country, 
Japan and USA. Second, the former studies involved a lot less large DIF items: only about 20% of the items 
were then detected as DIF. On the other hand, in our present study, most of the 48 items that we worked with 
were classified as moderate or severe DIF by at least one method. Also, we did not use internal criterion (total 
score) purification (Navas-Ara & Gomez-Benito, 2002) that may have generated different results, especially for 
IRT-based methods. Third, the fact that we got many large DIF in the present study makes it easier for all 
methods to detect those DIF items and therefore to have high within-method stability rates as well as high 
between-method agreement rates. 

The lack of purification of the internal criterion could be considered a limitation of this study but we should keep 
in mind that 

- results from table 2 and table 3 based on two different studies are consistent; 

a preliminary study using purified internal criterion (two-stages approach) for non IRT-based 
methods showed consistent results with what we got here not using purified internal criterion; 

- Gierl, Jodoin & Ackerman (2000) argued that purification may be uimecessary for methods like LR; 

- Zinesky, Hambleton & Robin (2003) specified that although the purification of the internal criterion 
is a well known issue, most of the researchers don’t use it anyway; 

- Navas-Ara & Gomez-Benito (2002) arguing for a purified internal criterion mentioned that, even 
without purifying, results from MH were relevant: our results show that MH got very high agreement 
rates with other methods, and especially with IRT-based methods. 

Based on the present results, we can argue that MH got much higher between-method agreement rates than LR. 
UPD method also performed very well. Now this method has a quite interesting intuitive appeal: an item is said 
to be DIF if the overall probability difference between the ICC of the focal group and the ICC of the reference 
group is higher than .10. Overall, methods used here got much better between-method agreement rates than 
observed in similar situations. Price (1999) for example found an agreement rate of only"* 20% between NCDIF 
and MH using also tests translated from English to Japanese. Gierl, Rogers & Klinger (1999) though found an 
agreement rate of 90% between LR and MH using a Canadian math test (reference group was English and focal 
group was French Immersion) while we found only a 67% agreement rate between those two methods. None of 
these studies used purification of the internal criterion. 



Some of our results are also consistent with other studies. For example, using logistic regression as a DIF 
detection procedure, Ercikan (1999) found a little more than 18% of DIF items in math TIMSS items (1995); we 
found, also using logistic regression, 9/24 (38%) DIF items in booklet 1, 5/24 (21%) DIF items in booklet 5 and 
4/24 (16%) DIF items in booklet 7. 



■* reported 20% agreement rate but our analysis of his data showed 73% agreement rate using our definition. 

ERIC Q 
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A final result is worth reporting: we found that all but one item flagged as an outlier/extreme value detected by a 
box-and-whisker (relative criterion) plot were also detected by the Raju’s DTF-CDIF procedure. Our study 
showed also that this relative criterion tends to detect much less DIF items when the DIF statistic involved 
(UPD, NCDIF, |A|, X ) had large variance, that is when the box width was large. 

(Many) more studies are surely needed before selecting «the best» (powerful, low cost, low type I error rate, 
high within-method stability rates, high between-method agreement rates) detection method, whether IRT-based 
or non IRT-based. Among those, the usefulness of a purified internal criterion for the IRT-based methods should 
be investigated thoroughly. 
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Figure 1 Box-and-whiskers of NCDIF index for booklets 1, 5 and 7, using the 95 and 99 TIMSS assessements showing 
outlier (O) and extreme (♦) values 
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Tables 



Table 1 Within-method stability rates of DIF methods using the 1996 and 1999 SAIP science assessments : 
Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 2 Within-method stability rates of DIF methods using the 1997 and 2001 SAIP math assessments : 
Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 3 Overall within-method stability rates of DIF methods using the 1995 and 1999 TIMSS math 
assessments (booklets 1, 5 and 7): Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD 
index. 
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Table 4 Within-method stability rates using the 1995 and 1999 TIMSS math assessments (booklet 1) : 
Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. Presented are category C 
and category B items (X items are also DIF). 
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[X° ; This item is also an outlier in a box-and-whiskers plot] 



[26 (DTF99) : Item 26 was also detected as DIF in 99 by the DTF-CDIF framework] 
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Table 5 Within-method stability rates using the 1995 and 1999 TIMSS math assessments (booklet 5) ; 
Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index Presented are category C 
and category B items (X items are also DIF). 
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[38 (DTF95) : Item 38 was also detected as DIF in 95 by the DTF-CDIF framework] 
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Table 6 Within-method stability rates using the 1995 and 1999 TIMSS math assessments (booklet 7) : 
Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index 

Presented are category C and category B items (X items are also DIF). 
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[X^ : This item is also an outlier in a box-and-whiskers plot] 



[23 (DTF95) : Item 23 was also detected as DIF in 95 by the DTF-CDIF framework] 



Table 7 Overall between-method agreement rates in the 1995 and the 1999 TIMSS math assessments 
(booklets 1, 5, 7) : Mantel-Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 8 Between-method agreement rates in the 1995 TIMSS math assessment (booklet 1) : Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 9 Between-method agreement rates in the 1999 TIMSS math assessment (booklet 1) : Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 10 Between-method agreement rates in the 1995 TIMSS math assessment (booklet 5): Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 11 Between-method agreement rates in the 1999 TIMSS math assessment (booklet 5) : Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 12 Between-method agreement rates in the 1995 TIMSS math assessment (booklet 7) : Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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Table 13 Between-method agreement rates in the 1999 TIMSS math assessment (booklet 7) : Mantel- 
Haenszel (MH), logistic regression (LR), NCDIF index and UPD index. 
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